56 research outputs found
Spatially Selective Deep Non-linear Filters for Speaker Extraction
In a scenario with multiple persons talking simultaneously, the spatial
characteristics of the signals are the most distinct feature for extracting the
target signal. In this work, we develop a deep joint spatial-spectral
non-linear filter that can be steered in an arbitrary target direction. For
this we propose a simple and effective conditioning mechanism, which sets the
initial state of the filter's recurrent layers based on the target direction.
We show that this scheme is more effective than the baseline approach and
increases the flexibility of the filter at no performance cost. The resulting
spatially selective non-linear filters can also be used for speech separation
of an arbitrary number of speakers and enable very accurate multi-speaker
localization as we demonstrate in this paper.Comment: Submitted to ICASSP 202
Uncertainty Estimation in Deep Speech Enhancement Using Complex Gaussian Mixture Models
Single-channel deep speech enhancement approaches often estimate a single
multiplicative mask to extract clean speech without a measure of its accuracy.
Instead, in this work, we propose to quantify the uncertainty associated with
clean speech estimates in neural network-based speech enhancement. Predictive
uncertainty is typically categorized into aleatoric uncertainty and epistemic
uncertainty. The former accounts for the inherent uncertainty in data and the
latter corresponds to the model uncertainty. Aiming for robust clean speech
estimation and efficient predictive uncertainty quantification, we propose to
integrate statistical complex Gaussian mixture models (CGMMs) into a deep
speech enhancement framework. More specifically, we model the dependency
between input and output stochastically by means of a conditional probability
density and train a neural network to map the noisy input to the full posterior
distribution of clean speech, modeled as a mixture of multiple complex Gaussian
components. Experimental results on different datasets show that the proposed
algorithm effectively captures predictive uncertainty and that combining
powerful statistical models and deep learning also delivers a superior speech
enhancement performance.Comment: 5 pages, 4 figure
DiffPhase: Generative Diffusion-based STFT Phase Retrieval
Diffusion probabilistic models have been recently used in a variety of tasks,
including speech enhancement and synthesis. As a generative approach, diffusion
models have been shown to be especially suitable for imputation problems, where
missing data is generated based on existing data. Phase retrieval is inherently
an imputation problem, where phase information has to be generated based on the
given magnitude. In this work we build upon previous work in the speech domain,
adapting a speech enhancement diffusion model specifically for STFT phase
retrieval. Evaluation using speech quality and intelligibility metrics shows
the diffusion approach is well-suited to the phase retrieval task, with
performance surpassing both classical and modern methods.Comment: Submitted to ICASSP 202
Audio-Visual Speech Enhancement with Score-Based Generative Models
This paper introduces an audio-visual speech enhancement system that
leverages score-based generative models, also known as diffusion models,
conditioned on visual information. In particular, we exploit audio-visual
embeddings obtained from a self-super\-vised learning model that has been
fine-tuned on lipreading. The layer-wise features of its transformer-based
encoder are aggregated, time-aligned, and incorporated into the noise
conditional score network. Experimental evaluations show that the proposed
audio-visual speech enhancement system yields improved speech quality and
reduces generative artifacts such as phonetic confusions with respect to the
audio-only equivalent. The latter is supported by the word error rate of a
downstream automatic speech recognition model, which decreases noticeably,
especially at low input signal-to-noise ratios.Comment: Submitted to ITG Conference on Speech Communicatio
Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes
The SepFormer architecture shows very good results in speech separation. Like
other learned-encoder models, it uses short frames, as they have been shown to
obtain better performance in these cases. This results in a large number of
frames at the input, which is problematic; since the SepFormer is
transformer-based, its computational complexity drastically increases with
longer sequences. In this paper, we employ the SepFormer in a speech
enhancement task and show that by replacing the learned-encoder features with a
magnitude short-time Fourier transform (STFT) representation, we can use long
frames without compromising perceptual enhancement performance. We obtained
equivalent quality and intelligibility evaluation scores while reducing the
number of operations by a factor of approximately 8 for a 10-second utterance.Comment: Accepted at Interspeech 202
- …